9 research outputs found

    Automated Knowledge Base Quality Assessment and Validation based on Evolution Analysis

    Get PDF
    In recent years, numerous efforts have been put towards sharing Knowledge Bases (KB) in the Linked Open Data (LOD) cloud. These KBs are being used for various tasks, including performing data analytics or building question answering systems. Such KBs evolve continuously: their data (instances) and schemas can be updated, extended, revised and refactored. However, unlike in more controlled types of knowledge bases, the evolution of KBs exposed in the LOD cloud is usually unrestrained, what may cause data to suffer from a variety of quality issues, both at a semantic level and at a pragmatic level. This situation affects negatively data stakeholders – consumers, curators, etc. –. Data quality is commonly related to the perception of the fitness for use, for a certain application or use case. Therefore, ensuring the quality of the data of a knowledge base that evolves is vital. Since data is derived from autonomous, evolving, and increasingly large data providers, it is impractical to do manual data curation, and at the same time, it is very challenging to do a continuous automatic assessment of data quality. Ensuring the quality of a KB is a non-trivial task since they are based on a combination of structured information supported by models, ontologies, and vocabularies, as well as queryable endpoints, links, and mappings. Thus, in this thesis, we explored two main areas in assessing KB quality: (i) quality assessment using KB evolution analysis, and (ii) validation using machine learning models. The evolution of a KB can be analyzed using fine-grained “change” detection at low-level or using “dynamics” of a dataset at high-level. In this thesis, we present a novel knowledge base quality assessment approach using evolution analysis. The proposed approach uses data profiling on consecutive knowledge base releases to compute quality measures that allow detecting quality issues. However, the first step in building the quality assessment approach was to identify the quality characteristics. Using high-level change detection as measurement functions, in this thesis we present four quality characteristics: Persistency, Historical Persistency, Consistency and Completeness. Persistency and historical persistency measures concern the degree of changes and lifespan of any entity type. Consistency and completeness measures identify properties with incomplete information and contradictory facts. The approach has been assessed both quantitatively and qualitatively on a series of releases from two knowledge bases, eleven releases of DBpedia and eight releases of 3cixty Nice. However, high-level changes, being coarse-grained, cannot capture all possible quality issues. In this context, we present a validation strategy whose rationale is twofold. First, using manual validation from qualitative analysis to identify causes of quality issues. Then, use RDF data profiling information to generate integrity constraints. The validation approach relies on the idea of inducing RDF shape by exploiting SHALL constraint components. In particular, this approach will learn, what are the integrity constraints that can be applied to a large KB by instructing a process of statistical analysis, which is followed by a learning model. We illustrate the performance of our validation approach by using five learning models over three sub-tasks, namely minimum cardinality, maximum cardinality, and range constraint. The techniques of quality assessment and validation developed during this work are automatic and can be applied to different knowledge bases independently of the domain. Furthermore, the measures are based on simple statistical operations that make the solution both flexible and scalable

    A systematic literature review of open data quality in practice

    Get PDF
    Context: The main objective of open data initiatives is to make information freely available through easily accessible mechanisms and facilitate exploitation. In practice openness should be accompanied with a certain level of trustwor- thiness or guarantees about the quality of data. Traditional data quality is a thoroughly researched field with several benchmarks and frameworks to grasp its dimensions. However, quality assessment in open data is a complicated process as it consists of stakeholders, evaluation of datasets as well as the publishing platform. Objective: In this work, we aim to identify and synthesize various features of open data quality approaches in practice. We applied thematic synthesis to identify the most relevant research problems and quality assessment methodologies. Method: We undertook a systematic literature review to summarize the state of the art on open data quality. The review process starts by developing the review protocol in which all steps, research questions, inclusion and exclusion criteria and analysis procedures are included. The search strategy retrieved 9323 publications from four scientific digital libraries. The selected papers were published between 2005 and 2015. Finally, through a discussion between the authors, 63 paper were included in the final set of selected papers. Results: Open data quality, in general, is a broad concept, and it could apply to multiple areas. There are many quality issues concerning open data hindering their actual usage for real-world applications. The main ones are unstruc- tured metadata, heterogeneity of data formats, lack of accuracy, incompleteness and lack of validation techniques. Furthermore, we collected the existing quality methodologies from selected papers and synthesized under a unifying classification schema. Also, a list of quality dimensions and metrics from selected paper is reported. Conclusion: In this research, we provided an overview of the methods related to open data quality, using the instru- ment of systematic literature reviews. Open data quality methodologies vary depending on the application domain. Moreover, the majority of studies focus on satisfying specific quality criteria. With metrics based on generalized data attributes a platform can be created to evaluate all possible open dataset. Also, the lack of methodology validation remains a major problem. Studies should focus on validation techniques

    Mood Classification of Bangla Songs Based on Lyrics

    Full text link
    Music can evoke various emotions, and with the advancement of technology, it has become more accessible to people. Bangla music, which portrays different human emotions, lacks sufficient research. The authors of this article aim to analyze Bangla songs and classify their moods based on the lyrics. To achieve this, this research has compiled a dataset of 4000 Bangla song lyrics, genres, and used Natural Language Processing and the Bert Algorithm to analyze the data. Among the 4000 songs, 1513 songs are represented for the sad mood, 1362 for the romantic mood, 886 for happiness, and the rest 239 are classified as relaxation. By embedding the lyrics of the songs, the authors have classified the songs into four moods: Happy, Sad, Romantic, and Relaxed. This research is crucial as it enables a multi-class classification of songs' moods, making the music more relatable to people's emotions. The article presents the automated result of the four moods accurately derived from the song lyrics.Comment: Presented at International Conference on. Inventive Communication and Computational Technologies 202

    A Quality Assessment Approach for Evolving Knowledge Bases

    Get PDF
    Knowledge bases are nowadays essential components for any task that requires automation with some degrees of intelligence.Assessing the quality of a Knowledge Base (KB) is a complex task as it often means measuring the quality of structured information, ontologies and vocabularies, and queryable endpoints. Popular knowledge bases such as DBpedia, YAGO2, and Wikidata have chosen the RDF data model to represent their data due to its capabilities for semantically rich knowledge representation. Despite its advantages, there are challenges in using RDF data model, for example, data quality assessment and validation. In thispaper, we present a novel knowledge base quality assessment approach that relies on evolution analysis. The proposed approachuses data profiling on consecutive knowledge base releases to compute quality measures that allow detecting quality issues. Our quality characteristics are based on the KB evolution analysis and we used high-level change detection for measurement functions. In particular, we propose four quality characteristics: Persistency, Historical Persistency, Consistency, and Completeness.Persistency and historical persistency measures concern the degree of changes and lifespan of any entity type. Consistency andcompleteness measures identify properties with incomplete information and contradictory facts. The approach has been assessed both quantitatively and qualitatively on a series of releases from two knowledge bases, eleven releases of DBpedia and eight releases of 3cixty. The capability of Persistency and Consistency characteristics to detect quality issues varies significantly between the two case studies. Persistency measure gives observational results for evolving KBs. It is highly effective in case of KBwith periodic updates such as 3cixty KB. The Completeness characteristic is extremely effective and was able to achieve 95%precision in error detection for both use cases. The measures are based on simple statistical operations that make the solution both flexible and scalabl

    Completeness and Consistency Analysis for Evolving Knowledge Bases

    Full text link
    Assessing the quality of an evolving knowledge base is a challenging task as it often requires to identify correct quality assessment procedures. Since data is often derived from autonomous, and increasingly large data sources, it is impractical to manually curate the data, and challenging to continuously and automatically assess their quality. In this paper, we explore two main areas of quality assessment related to evolving knowledge bases: (i) identification of completeness issues using knowledge base evolution analysis, and (ii) identification of consistency issues based on integrity constraints, such as minimum and maximum cardinality, and range constraints. For completeness analysis, we use data profiling information from consecutive knowledge base releases to estimate completeness measures that allow predicting quality issues. Then, we perform consistency checks to validate the results of the completeness analysis using integrity constraints and learning models. The approach has been tested both quantitatively and qualitatively by using a subset of datasets from both DBpedia and 3cixty knowledge bases. The performance of the approach is evaluated using precision, recall, and F1 score. From completeness analysis, we observe a 94% precision for the English DBpedia KB and 95% precision for the 3cixty Nice KB. We also assessed the performance of our consistency analysis by using five learning models over three sub-tasks, namely minimum cardinality, maximum cardinality, and range constraint. We observed that the best performing model in our experimental setup is the Random Forest, reaching an F1 score greater than 90% for minimum and maximum cardinality and 84% for range constraints.Comment: Accepted for Journal of Web Semantic

    Energy Consumption Analysis of Algorithms Implementations

    No full text
    Context: Mobile devices, typically battery driven, require new efforts to improve the energy efficiency of both hardware and software designs. Goal: The goal of this work is to analyze the energy efficiency of different sorting algorithms implementations. Method: We set up an experiment on an ARM based device, measuring the energy consumption of different sorting algorithms implemented in different programming languages. Result: The algorithms and languages exhibit significantly different energy consumption, the ARM assembly language implementation of Counting sort is the greenest solution. Conclusion: We provide the basic information to select al- gorithms, and we identified the main factors affecting energy consumptio

    Energy Consumption Analysis of Image Encoding and Decoding Algorithms

    No full text
    Context: energy consumption represents an important issue with limited and embedded devices. Such devices, e.g. Smartphones, process many images, both to render the UI and for application specific purposes. Goal: we aim to evaluate the energy consumption of different image encoding/decoding algorithms. Method: we run a series of experiments on a ARM based platform, and we collected the energy consumed in performing typical image encoding and decoding tasks. Result: we found that there is a significant difference among codecs in terms of energy consumption. Most of the energy consumption relates to the computational efficiency of the algorithm (i.e. The time performance) though the type of processing and the algorithm may affect the average power usage up to 37%, thus indirectly affecting the energy consumption. Conclusion: JPEG compression is significantly more energy efficient than PNG both for encoding and decoding. Further studies should focus on the additional features that affect energy consumption beyond computational complexity

    RDF Shape Induction using Knowledge Base Profiling

    No full text
    Knowledge Graphs (KG) are becoming the core of most artificialintelligent and cognitive applications. KGs describe the real-worldentities and their relationships. Popular knowledge graphs such asDBpedia, YAGO2, and Wikidata have chosen the RDF data modelto represent their data due to its capabilities for semantically richknowledge representation. Despite the advantages, there are chal-lenges in using RDF data, for example, data validation. Ontologies,the most common manner of specifying domain conceptualizationsin RDF data, are designed for entailments rather than validation.Most ontologies lack the granular information needed for validat-ing constraints. Recent work on RDF Shapes and standardization oflanguages such as SHACL and ShEX provide better mechanisms forrepresenting constraints for RDF data. However, manually creatingintegrity constraints for large KGs is still a tedious task. This bringsa clear need for methods and tools that could help to generate suchconstraints automatically or semi-automatically. In this paper, wepresent a data driven approach for inducing integrity constraintsfor RDF data using data profiling. Those constraints can be com-bined into RDF Shapes and can be used to validate RDF graphs.Our method is based on machine learning techniques to automati-cally generate RDF shapes using profiled RDF data as features. Inthe experimets, the proposed approach achieved 97% precision inderiving RDF Shapes with cardinality constraints for a subset ofDBpedia data

    LeafNet: A proficient convolutional neural network for detecting seven prominent mango leaf diseases

    No full text
    Fruit production plays a significant role in meeting nutritional needs and contributing to the lessening of the global food crisis. Plant diseases are quite a common phenomenon that hampers gross production and causes huge losses for farmers in tropical South Asian weather conditions. In context, early-stage detection of plant disease is essential for healthy production. This research develops LeafNet, a convolutional neural network (CNN)-based approach to detect seven of the most common diseases of mango using images of the leaves. This model is trained specially for the pattern of mango diseases in Bangladesh using a novel dataset of region-specific images and is classified for almost all highly available mango diseases. The performance of LeafNet is evaluated with an average accuracy, precision, recall, F-score, and specificity of 98.55%, 99.508%, 99.45%, 99.47%, and 99.878%, respectively, in a 5-fold cross-validation that is higher than the state-of-the-art models like AlexNet and VGG16. LeafNet can be helpful in the detection of early symptoms of diseases, ultimately leading to a higher production of mangoes and contributing to the national economy
    corecore